-
Notifications
You must be signed in to change notification settings - Fork 11
update vcat to mimic Base.vcat and enhance promotion rules of mixed column type #45
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks mostly good. Please copy the explanations to the commit message (BTW, I think by "slurping" you mean "splatting"?).
It's not completely correct to say "The current vcat operates on Arrays of DataTable", since the vararg form was also supported: what you are doing is that you remove the form taking a vector of tables.
allheaders = map(names, dts) | ||
# don't vcat empty DataTables | ||
notempty = find(x -> length(x) > 0, allheaders) | ||
uniqueheaders = unique(allheaders[notempty]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this: is there a strong reason not to check headers of empty tables? If they don't match, it could indicate a bug in the user code, and I can't find a situation where this would be useful.
Also, it's not correct to return DataTable()
in that case since that doesn't preserve the headers if only empty tables are passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agree here, but just to clarify
julia> vcat(DataTable(), DataTable(), DataTable()) #this is correct
0×0 DataTables.DataTable
julia> vcat(DataTable(), DataTable(), DataTable(x=[])) # this should be an error
0×1 DataTables.DataTable
julia> vcat(DataTable(), DataTable(), DataTable(x=[1])) # should also be an error
1×1 DataTables.DataTable
│ Row │ x │
├─────┼───┤
│ 1 │ 1 │
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I meant.
estrings = Vector{String}(length(uniqueheaders)) | ||
for (i, u) in enumerate(uniqueheaders) | ||
matchingloci = find(h -> u == h, allheaders) | ||
headerdiff = filter(x -> !in(x, u), coldiff) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just setdiff(coldiff, u)
, right?
for (i, u) in enumerate(uniqueheaders) | ||
matchingloci = find(h -> u == h, allheaders) | ||
headerdiff = filter(x -> !in(x, u), coldiff) | ||
headerdiff = join(string.(headerdiff), ", ", " and ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid reusing variable names for values of different types, this creates a type instability. Same below.
Also string
isn't needed here an below. So you might be able to put the commands directly inside the string literal.
filter!(u -> Set(u) != Set(unionunique), uniqueheaders) | ||
estrings = Vector{String}(length(uniqueheaders)) | ||
for (i, u) in enumerate(uniqueheaders) | ||
matchingloci = find(h -> u == h, allheaders) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "loci" really useful here?
for i in 1:length(cols) | ||
data = [dt[i] for dt in dts_to_vcat] | ||
res = promote_col_type(data...)(mapreduce(length, +, data)) | ||
cols[i] = copy!(res, vcat(data...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling vcat
will create a temporary array. Better repeatedly call copy!
with a changing offset (see do
argument).
@@ -193,7 +193,7 @@ combine(map(d -> mean(dropnull(d[:c])), gd)) | |||
""" | |||
function combine(ga::GroupApplied) | |||
gd, vals = ga.gd, ga.vals | |||
valscat = vcat(vals) | |||
valscat = vcat(vals...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this go into a separate commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different commit but same PR? I can do that 👍
Thanks for the review! I'll get to these edits soon. re: slurping/splatting, I'm not sure if I used the right one. Both descriptions are used in the docs, but I'll say something more explicit in the commit to avoid any ambiguity. |
Funny, I didn't remember that term. It's fine then. |
That should address all of the comments. I added more tests to assert that the array promotion was working as intended, and a couple tests for vcatting 3 or more datatables to make sure my offset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it's hard for me to keep up with the reviews. Should be good to go after this round.
for i in 1:length(cols) | ||
data = [dt[i] for dt in dts] | ||
lens = map(length, data) | ||
cols[i] = promote_col_type(data...)(reduce(+, lens)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just sum(lens)
? Same below.
cols[i] = promote_col_type(data...)(reduce(+, lens)) | ||
copy!(cols[i], data[1]) | ||
for j in 2:length(data) | ||
offset = reduce(+, lens[1:j-1]) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset = lens[1]
for initialization and offset += lens[j]
after each iteration would be simpler.
test/cat.jl
Outdated
@test eltypes(dt)[1] == Any | ||
dt = vcat(DataTable([NullableArray([1])], [:x]), DataTable([[1]], [:x])) | ||
@test dt == DataTable([NullableArray([1, 1])], [:x]) | ||
@test eltypes(dt)[1] == Nullable{Int} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rather test the type of the column, to make sure we get a NullableArray
rather than an Array{Nullable}
? Same below for (Nullable)CategoricalArray
vs. Array{CategoricalValue}`.
test/grouping.jl
Outdated
@test isequal(sum(dt2[:x]), Nullable(0)) | ||
@test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add this line in particular, and check only its size? It looks out of place in this block. Either remove it or make it more complete in its own block.
No worries! It's not that easy to keep up with the rate at which you do manage to review either. Thanks, as always, for putting in the time to help me improve these PRs. |
Coverage drop doesn't seem related here either. Several functions flagged by Coveralls are handled by #31 |
test/cat.jl
Outdated
@test typeof.(dt.columns) == [NullableVector{Int}] | ||
dt = vcat(DataTable([CategoricalArray([1])], [:x]), DataTable([[1]], [:x])) | ||
@test dt == DataTable([CategoricalArray([1, 1])], [:x]) | ||
@test typeof.(dt.columns) == [CategoricalVector{Int, UInt32}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave out the UInt32
(or replace it with CategoricalArrays.DefaultRefType
, but probably not worth it).
test/grouping.jl
Outdated
@@ -148,7 +148,6 @@ module TestGrouping | |||
dt2 = by(e->1, DataTable(x=Int64[]), :x) | |||
@test size(dt2) == (0,2) | |||
@test isequal(sum(dt2[:x]), Nullable(0)) | |||
@test size(by(e->1, DataTable(x=Int64[1]), :x)) == (1,2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should put this in the first commit.
This change is helpful to support changing vcat to be more consistent with Base.vcat, where passing an array is not allowed.
This PR removes vcat support for arrays of datatables and makes the Base.vcat style of vcat(args...) the only call option. Removes assumptions for joining datatables with missing, unique, and out of order columns. vcat'ing datatables with unmatched headers results in error messages that explain how the columns are inconsistent. Uses @nalimilan's @generated function to implement a new type of AbstractArray promotion rule that improves handling of NullableArrays and CategoricalArrays. Extends vcat testing.
Thanks! |
The current
vcat
operates on Arrays of DataTablesDataTable[]
, whileBase.vcat
utilizes slurping to capture any number of inputs as a tuple or argument to concatenate. This PR changesvcat
to followBase.vcat
's lead and utilizes a function structure ofvcat(dts::AbstractDataTable...)
. Uses @nalimilan's@generated
function to implement a new type ofAbstractArray
promotion rule that ideally can be tested here and transfered into Base in the near future JuliaLang/julia#20815. This also removes the prior assumptions that were made about how to join datatables that were out of order or had columns that were only present in some of the arguments and not others.